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Abstract 

Background: Through transcription and alternative splicing, a gene can be transcribed into different RNA 
sequences (isoforms), depending on the individual, on the tissue the cell is in, or in response to some stimuli. 
Recent RNA-Seq technology allows for new high-throughput ways for isoform identification and quantification 
based on short reads, and various methods have been put forward for this non-trivial problem. 

Results: In this paper we propose a novel radically different method based on minimum-cost network flows. This 
has a two-fold advantage: on the one hand, it translates the problem as an established one in the field of network 
flows, which can be solved in polynomial time, with different existing solvers; on the other hand, it is general 
enough to encompass many of the previous proposals under the least sum of squares model. Our method works 
as follows: in order to find the transcripts which best explain, under a given fitness model, a splicing graph 
resulting from an RNA-Seq experiment, we find a min-cost flow in an offset flow network, under an equivalent cost 
model. Under very weak assumptions on the fitness model, the optimal flow can be computed in polynomial time. 
Parsimoniously splitting the flow back into few path transcripts can be done with any of the heuristics and 
approximations available from the theory of network flows. In the present implementation, we choose the simple 
strategy of repeatedly removing the heaviest path. 

Conclusions: We proposed a new very general method based on network flows for a multiassembly problem 
arising from isoform identification and quantification with RNA-Seq. Experimental results on prediction accuracy 
show that our method is very competitive with popular tools such as Cufflinks and IsoLasso. Our tool, called Traph 
(Transcrips in gRAPHs), is available at: http://www.cs.helsinki.fi/gsa/traph/. 



Background 

Recent RNA-Seq technology [1,2] opened a new high- 
throughput, low cost way for isoform identification and 
quantification, leading to new understanding of gene 
regulation in development and disease (e.g., [3]). In an 
RNA-Seq experiment a set of short reads is produced 
from mRNA transcripts. The difficulty in assembling 
these short reads into the transcripts from which they 
were sampled is non-trivial due to the fact that the tran- 
scripts (isoforms) may share exons. As a result, all meth- 
ods for solving this problem rely on an explicit or 
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implicit graph model. The nodes represent individual 
reads (overlap graph [4]), or contiguous stretches of 
DNA uninterrupted by spliced reads (splicing graph 
[5-7], connectivity graph [8-10]), while the edges are 
derived from overlaps or from spliced read alignments. 
Each node and edge has an associated observed cover- 
age, and the problem of isoform identification and 
quantification is seen as separating the coverage of the 
graph into individual path components, under different 
models. Furthermore, this problem was also coined 
under the broad name 'Multiassembly Problem' [11], a 
hint that it can arise not only with RNA-Seq data, but 
also in other biological settings, such as assembling 
metagenomics reads [12]. 
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Except for Cufflinks [4], all tools mentioned above rely 
on some optimization engine, whose solving is generally 
difficult. IsoInfer/IsoLasso [8,9], SLIDE [7], Scripture [10], 
and CLIIQ [6] exhaustively enumerate all possible candi- 
date paths. For efficiency reasons, each has some restric- 
tions on what a valid candidate path might be, and for 
each candidate isoform, they define a fitness function. 
IsoInfer/IsoLasso and SLIDE use a least sum of squares fit- 
ness function; IsoLasso and SLIDE both add different 
shrinkage terms to the fitness function in order to favor 
isoforms with fewer transcripts, which is computed with a 
modified LASSO algorithm, or a quadratic program; 
CLIIQ uses a least sum of absolute differences fitness 
function, solved by a linear integer program. Cufflinks 
avoids the problem of exhaustively enumerating all possi- 
ble paths by returning a minimum path cover, and then 
assigning expression levels to each path in this cover 
based on a statistical model. Incidentally, note that com- 
puting a minimum path cover (in an acyclic digraph) is 
done by computing a maximum matching, which can be 
easily reduced to a flow problem. However, such reduction 
solves a different (implicitiy defined) optimization problem 
that can be considered as a consensus model in the litera- 
ture [6-10], mostly because the fitting of expression levels 
is separated in the process. 

Our contribution 

In this paper we propose a radically different and very 
general method relying on the established field of mini- 
mum-cost network flow problems [13]. This will not 
only provide a simple method and a fast polynomial 
time algorithm for solving it (as opposed to exhaustively 
enumerating all possible candidate paths, and then sol- 
ving a quadratic/integer linear program for evaluating 
the fitness of each candidate isoform), but it can also 
lean on the ample literature on splitting a (min-cost) 
flow into paths, e.g., [14-17]. 

As in the case of the other tools, our method assumes 
that a splicing graph has been built for each gene. Each 
node of the graph corresponds to a stretch of DNA unin- 
terrupted by any spliced read alignment; such sequences 
are called segments in [9], but for simplicity we just call 
them exons. Each edge of the graph corresponds to two 
exons consecutive in some transcript, that is, to some 
spliced read whose prefix aligns to the suffix of one exon, 
and whose suffix aligns to the prefix of another exon. 
Observe that such a graph can be seen as a directed acyc- 
lic graph (DAG, for short), the direction of the edges 
being according to the absolute position of the exons in 
the genome. For each exon v we can deduct its coverage 
cov(v) as the total number of reads aligned to the exon 
divided by the exon length, and the coverage cov(u, v) of 
an edge (u, v) as the total number of reads split aligned 
to that junction between exons u and v. An mRNA 



transcript candidate thus becomes a path from some 
source node to some sink node. The requirement that 
the transcripts start in a source node and end in a sink 
node is no restriction, as we can add dummy source/sink 
nodes as in-/out-neighbors to the nodes where we have 
indication that some transcript might start/end. Indeed, 
our splicing graph creation tool uses splicing alignments 
and coverage information to discover such start/end 
nodes and accordingly indicates them to our tool. 

In order to define a fitness function in the broadest 
possible terms, let us assume that for each node v and 
edge (w, v) of the graph we have convex cost functions 
fv,fuv : K — >• R. modeling how close that node and edge 
must be explained by the candidate isoform. Then, we 
can state the problem of isoform identification and 
quantification as following problem. 

Problem 1 (UTEC) Given a splicing DAG G = (V, E) 
with coverage values cov(v) and cov(u, v), and cost func- 
tions f v {-) and f U v{')t for all v e V and (u, v) e E, the 
Unannotated Transcript Expression Cover problem is to 
find a tuple V of paths from the sources of G to the sinks 
of G, with an estimated expression level e(P) for each 
path PeP, which minimize 



sum_err(C, V) 



•54 



«M- E < p ) 



PzV : (U,v) € P 



For example, if for all nodes v and edges {u, v),f v {x) = x, 
fuv{x) = x, then we have a least sum of absolute differ- 
ences model as in CLIIQ. lif u (x) = x 2 ,f uv (x) = x 2 , then we 
have a least sum of squares model as in IsoInfer/IsoLasso 
and SLIDE; this is the model which we also use in the 
implementation reported in this paper. Another cost 
function, suggested by [18], is /„(x) = x/^/cov(v), 
fuv(x) = x/tJcov(u, v) for all nodes v and edges [u, v). 
Observe that many of the other biological assumptions of 
the other tools can be incorporated in the model of Pro- 
blem UTEC. 

We will show that Problem UTEC can be solved in 
polynomial time, by a reduction to a min-cost flow pro- 
blem with convex cost functions. We will argue that find- 
ing the optimal tuple of paths explaining the graph is 
equivalent to finding the optimal flow in an offset flow 
network. Moreover, any splitting of this optimal flow into 
paths attains the minimum of Problem UTEC. In the 
same way as some of the other tools try to limit the num- 
ber of paths explaining a splicing graph by a LASSO 
approach, we can rely on established methods for split- 
ting any flow into few paths (e.g., [14-17]). In this paper, 
we employ only the simple linear-time heuristic of 
repeatedly removing the heaviest path, see e.g., [15]. 

We give experimental results to study how well the 
predictions match the ground-truth on simulated data, 
and how well it fares on real-data, compared to Cufflinks 
[4] and IsoLasso [9]; our method is very competitive, 
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providing in many cases better precision and recall. We 
expect our lead to be even greater once we incorporate 
paired-end read information. 

Methods 

We begin by recalling the basic notions of flow and of a 
min-cost flow problem, and refer to the excellent mono- 
graph [13] for further details. A flow network (or simply 
network) is a tuple N = (G, b, q), where G = (V, E) is a 
directed graph, b is a function assigning a capacity 
b uv e N to every arc (u, v) e E, and q is a function 
assigning an exogenous flow q v € N to every node v e V, 
such that Ylvev °lv ~ 0- We say that a function x assign- 
ing to every arc (u, v) e E a number x uv e N is a flow 
over the network N, if the following two conditions are 
satisfied: 

1. 0 < x uv < b uv , for every (u, v) e E, 

2. £ x vu — X! x uv = (\v for every v e V, 

ueV ueV ' 

In a min-cost flow problem, one is additionally given 

flow cost functions c uv (-), for every arc (u, v) e E, and is 

required to find a flow which minimizes: 

(u,v) € £ 

It is well-known that, under the assumption that all 
the flow cost functions c MV (-) are convex, a min-cost 
flow can be found in polynomial time [19] (see also [20] 
for the real-valued flow case). 

The reduction to a min-cost flow problem 

We will model Problem UTEC as a min-cost flow pro- 
blem, thus showing that it can be solved in polynomial 
time. First, we argue that it can be transformed into the 
following equivalent problem, where the input exon 
chaining graph has measured coverages only on arcs. 

Problem 2 (UTEJC) Given a splicing DAG G = (V, E) 
with coverage values cov{u, v), and cost functions f uv (-), 
for all (u, v) e E, the Unannotated Transcript Expres- 
sion Junction Cover problem is to find a tuple T> of 
paths from the sources of G to the sinks of G with an 
estimated expression level e(P) for each path P e V, 
which minimize 



E /« 

[u,v] e E 



cov{u, v) 



E < p ) 



Pe V: [u,v] e P 



Given an input G = (V, E) for Problem UTEC, we 
construct an input for Problem UTEJC by replacing 
every node v e V with two new nodes, v in and v out , and 



an arc 



Vout), with cov{v in 



cov(v), and 



f"i„v oa ,{ x ) ~ fv[ x )- Furthermore, for every arc (u, v) e E, 
we replace arc (u, v) with the arc (u ouB v in ), with the 
same coverage as (u, v). It is immediate that optimal 



solutions for G to Problem UTEC are in bijection with 
the optimal solutions for the transformed graph to Pro- 
blem UTEJC. 

To solve Problem UTEJC, we build an auxiliary offset 
network with convex costs of the form c uv (x) = f uv {x)- 
An optimal flow for this network will model the offsets 
(positive or negative) between the measured coverages 
of the exon chaining graph and their actual expression 
levels in an optimal solution. Then, we argue that a 
min-cost flow on this network naturally induces a solu- 
tion for the UTEJC problem. 

Onwards, we denote by Nj~(v) the set of out-neighbors 
of v in the directed graph G, that is, the set {w : (v, w) e 
E{G)}. Similarly, we denote by Nq(v) the set of in-neigh- 
bors of v in the directed graph G, that is, the set {u : {u, v) 
e E(G)}. When G is clear from the context, we will skip 
it as subscript. 

Given a splicing DAG G with coverage values cov(u, 
v), and cost functions f uv , for all (u, v) e E, we construct 
the offset network N* = (G* b, q) with cost function c, 
as follows (see Figure 1 for an example): 

1. we add to G* all nodes and edges of G, together 
with 

(a) a new source s 0 and a new sink t 0 with 
q So := q t0 := 0, 

(b) arcs (so, s), for every source s of G, and arcs 
(t, t 0 ) for every sink t of G, each with infinite 
capacity and null cost function, 

(c) arc (to, s 0 ) with infinite capacity and null cost 
function, 

(d) nodes s* and ?*, with initial exogenous flow 
q s , ■= q t , ■= 0; 

2. for every arc (u, v) e E(G), 

(a) b uv := oo, c uv (x) : =f uv (x), 

(b) we add the reverse arc (v, u) to G" with b vu :- 
cov(u, v), c vu (x) : = f uv (x); 

3. for every v e V(G), 

(a) its exogenous flow q v is zero, 

(bl if cov(v, u) — \^ cov(u, v) > 0 

we add arc (v, f) to G" where: 

ii we update *« : =f.. + E„ eN - M c<w ( u '")-E„ eN . w 

(c) if EneW(v) COy ^' ") " CW ( M ' V ^ < °- 
we add arc (s", v) to G* where: 

ii. we update *• «• + E U6 n-m cm i v - ") - E^n-m coi, (". »). 

The next lemma shows that there exists a min-cost 
flow x* on N*. 

Lemma 1 Given a digraph G with arc coverages cov 
(•,•), the offset network N" = (G*, b, q) constructed as 
above is a flow network, i.e., 12veV[G*) 4" = ®- 
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2 

Figure 1 Example of an offset network. An input G to Problem UTEJC (a), and the offset network G* (b); arcs are labeled with their capacity, 
unlabeled arcs having infinite capacity 



Proof: Since q v = 0, for all v e V (G*) \ {s* t*}, it 
remains to show that + <7t* = 0. Indeed, 

q s * + fa = ^ I ^ cov{u, v) — 2^, cov (v> u ) I 
cov(u, v) — cov(v, u) = 0 

(u,v)e£(G) (K,u)e£(G) 

□ 

From such a flow x*, we construct the function x on 
the edges G as follows. First, observe that for every arc 
(u, v) e E(G), at most one of x* m or x* u \s nonnull. 
Indeed, if this were not the case, then a flow y* which 
coincides with x*, except for y* uv := x* uv — mm(x* v , x^) 
and y* u := x* m — mm(xl v , x* m ), is also a flow on Af* and 
has a strictly smaller cost than x", contradicting the fact 
that x* is of minimum cost. Then, for each arc (u, v) e 
E(G) we set: 

x m :=cov{u, v) +x* uv -x* m . 
From a flow to a set of paths 

Theorem 1 below will argue that the above defined 
function x is a flow on G (points (1), (2)), whose arcs 
we consider to have unbounded capacities and whose 
nodes, apart from the sources and sinks, have exogenous 
flow 0. It is a well-known result from classical network 
flow theory that such a flow can be decomposed into 
paths, that is, there exist paths P\ , . . . , P t from the 
sources of G to the sinks of G, having weights w 1( . . . , 
w t , respectively, such that, for every (u, v) e E(G) we 
have 

X U y — ^ ^ W[. 

i : [u,v) belongs to P, 

Moreover, a decomposition of x into at most \E(G)\ paths 
always exists and can be found in time | V (G)| • E(G). 



Theorem 1 also shows that the paths of any decomposition 
of x are an optimal solution for G to Problem UTEJC 
(point (3)). 

Theorem 1 Given an optimal flow x" on G*, the func- 
tion x on G just constructed satisfies the properties, 
where S denotes the set of sources of G, and T denotes 
the set of sinks of G: 

(1) for alive V (G) \ (S U T), J2 ue N-(v) *W = EueNjftO x ™; 

(2) UseS 2~ZveN+(s) x w = 2~ZteT J2ueN-[t) x ut 

(3) any decomposition of x into paths attains the mini- 
mum of the objective function of Problem UTEJC, on 
input G. 

Proof: (1): Let v e V (G) \ (S U T); by the definition of 
x, we can write 



E 


x vu = (cov[u, v) + x* v — xl 


E 








E 


cov[u, v) — ^2 cov[v, u)+ 










E 




E * 






ugNJ(i<) 


E 


cov[u, v) — Y^ cov (. v - u ) + 


E 






,eN;(i,)UNj 



Observe that for all edges entering f" (exiting s*), their 
flow equals their capacity, as we have adjusted the exo- 
genous flow of t (of 5*) at point 3.(b)ii. (and 3.(c)ii.) in 
the construction of G*. We distinguish three cases. 

First, if J2ueN-{v)UN+ G (v) Kiv ~ JlueN-{v)UN* c {v)Ku > °, 

then J2ueN-{v)uNl(v) x uv ~ J2ueN-{v)uN+(v) x *w = x vt*- Since 
the flow X s ' uses the arc (v, if) with its maximum capa- 
city, we have that *«• - b «- = E„ 6N;( „) ""t"' ^"ZLn^m""^"' "\ 
which shows that J2ueN-(v) x uv ~ T, U eN+(v) x "u = 0, prov- 
ing the claim. 

Second, if Y2u€N~{v)un+ g [v) x *uv ~ J2ueN-{v)uN+{v) x tu < °> 
then ^,ueN~{v)uNi(v) x uv ~ HueN^^uNHv) x *u~ = ~ x **v Since 
the flow x* uses the arc (5*, v) with its maximum capacity, 
we have that *L = = £ u6 ^oo cm i v ' ") ~ Eknjm cov i u ' v \ 
which again proves the claim. 

Finally, if J2ueN-{i>)<JN+Xv) x *w ~ J2uen-{v)un*Xv) Ku = °< 
then, by construction there is no edge between v and if 
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or s~, implying, again by construction, that 
HueN-(v) cov (u, v) = EueNSM cov ( v > from which the 
claim follows. 
(2): From the definition of x, we have 



£ £ Xsv = £ £ ( cw ( s ' v "> + x ™ ~ x *^ (1) 

s6S»eN*(s) seS \veNJ(s) / 



= E £ cw ( s ' y ) + £ < - £ < w 

seS \vgN£(s) "gN£(s) "gN£(j) / 

By construction, since ^ s = 0 for all s e S, we 
have 4* + J2 veNi{s) x **> = £ veN£W *" + **^ Therefore, 

E^-to 4 " E„ 6Nas) < - - * - <. - - <s - E^ g(s , "K* 
Plugging this into (2), we obtain 



£ £ x$v - £ - 

seSueNJ(s) seS 
Similarly, 



(3) 



£ £ *"t = £( £ [cov(u, t)+< t -x*) J (4) 



J2 £ cw(«, o + £ <t - £ 4< (5) 



By construction, since q t = 0 for all teT, we have 
*s«t + EueNc(t) *ut = x *i 0 ~~ J2ueN-(t) x *tu- Therefore, 

Plugging this into (5), we prove the claim, since by (3) 
we get 



£ £ xm - £ x *u 

teT ueN~(t) 



tto X toSa 



teT 



£ £ 



(6) 



(3): Since any tuple of paths V = (Pi, Pi, ■ ■ ■ , Pk) from 
sources of G to sinks of G, induces a flow on G, where 
the exogenous flow of all nodes which are not sources 
nor sinks is zero, and any such flow can be split into 
paths from sources to sinks, the minimum value of 



£ /« 

[u,v)eE(G) 



cov(u, v) — £ gj 
P, e V.-(u,v) e P, 



(7) 



over all all ^-tuples of paths V = [Pi, Pi, Pk) 
from a source of G to a sink of G, and over all expres- 
sion levels e, for each P it is equal to min^ is a flow on G 

f uv (\cov(u, v) - )y,,|). Since any flow on G 



£ 



induces a flow on G*, and vice versa, the above is equal 
to 



mm 

z is a flow on G* 



Since 



argmin £ / u „(z u „) + fuvfau), 



z is a flow on G 



(8) 



(u,p)e£(G) J 



and from minimality, for all arcs [u, v) e E(G), at 
most one of z uv or z vu is non null, we have that x* also 
attains the minimum in (7), proving the theorem. □ 

In our implementation we use the min-cost flow 
engine available in the LEMON Graph Library [21]. If 
no engine for arbitrary convex cost functions is avail- 
able, or, more generally, if the cost functions themselves 
happen not to be convex, one can approximate any cost 
function with piecewise constant or convex cost func- 
tions: e.g., one can replace an arc (u, v) of capacity b uv , 
with \b uv \ arcs of capacity 1, such that first arc has cost 
f{\), and the z'th arc, i >1, has cost fit) - fii - 1) (this 
reduction is only pseudo-polynomial but reveals quite 
effective in practice), see [13] for further details. 

Decomposing the min-cost flow into few paths 

As already shown by the other tools, we are generally 
interested in parsimoniously explaining an RNA-seq 
experiment, that is, in finding, among the optimal solu- 
tions to Problem UTEC, one with a low number of 
paths. At a closer analysis it can be seen that any flow 
on a graph G = (V, E) can be decomposed into at most 
\E\ - \ V\ +2 paths [14]. However, decomposing a flow 
into a minimum number of paths is an NP-hard pro- 
blem in the strong sense, even when restricted to DAGs 
[14,15]. To overcome this limitation, various heuristics 
and approximations have been put forth, see, e.g., 
[14-17] and the references therein. The advantage of our 
method is that once we have obtained the optimal flow, 
we can apply any of these methods to split the flow into 
few paths. For simplicity, in this paper we employ the 
policy of repeatedly removing the heaviest path, see, e.g., 
[15]: until the network has null flow, we select a path 
from the sources to the sinks whose minimum flow on 
its edges is maximum, report it as transcript, and 
remove it from the flow network. 

Results and discussion 

We call our tool Traph (Transcripts in Graphs). We 
compared Traph to the most used isoform prediction 
tool Cufflinks [4] and with IsoLasso [9]. We also tried 
to include SLIDE [7] and CLIIQ [6], but we could not 
make the former work reliably, and for the latter the 
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publicly available version was not yet available. Full 
experiment data is available at [22]. 

We should point out from the start that Traph is not yet 
employing paired-end read information. Nonetheless, the 
experiments we report (both simulated and real) are with 
paired-end reads, Cufflinks and IsoLasso having access to 
the paired-end information. Moreover, since Traph is not 
yet employing existing gene annotation information, we 
ran Cufflinks and IsoLasso without annotation. As already 
mentioned, we use a least sum of squares model. We 
experimented in the current implementation with other 
cost functions, as mentioned in the introduction, 
f z (x) = x,fz(x) = x/^Jcov(z), or f z (x) = x/cov(z), respec- 
tively, for all nodes and edges z, but they currently give 
worse results. 

Matching criteria 

In order to match the predicted transcripts with the true 
transcripts, we take into account the DNA sequences 
but also the expression levels. For each gene, we con- 
struct a bipartite graph with the true transcripts 
T = (Ti, Ti, . . .) as nodes in one set of the bipartition, 
and the predicted transcripts V = {P\, P2, ■ ■ ) as nodes 
in the other set of the bipartition. Empty sequences with 
0 expression level were added so that both sets of the 
bipartition had an equal number of nodes. 

To define the costs of the edges of this bipartite 
graph, let us introduce (cf. Normalized Compression 
Distance [23]) the binary encoding of a true transcript T 
and its expression level e(T) with respect to a predicted 
transcript P with expression level e(P) 

code(T|P, ;) = y(j)y{d + l)editsencoded(T, P)y(f[e{T) - e(P))), (9) 

where y{x) = o'*"^*'' 1 lbin{x), bin{x) being the binary 
encoding of x >0, j is the index of P in the list of predicted 
transcripts, d is the unit cost (Levenshtein) edit distance of 
T and P, editsencoded(r, P) lists the edits and gaps 
between edits using 2-bit fixed code for edit type, 2-bit 
fixed code for substituted/inserted symbol, and y(x+l) for 
gap (run of identities) of length x, and f(x) is a bisection 
between {0, 1, -1, 2, -2, . . .} and {1, 2, 3, 4, 5, . . .} defined 
as f{x) = 2x for x > 0 andf(x) = 2{-x) + 1 otherwise. 

Then, the edge cost between nodes T, e T and Pj e V 
is defined as |code(r ; - | Pj, j)\ - \y(j)\. The closer to zero 
this number is, the better the match between true tran- 
script Ti, with true expression level e(r ; ) and predicted 
transcript Pj with predicted expression level e{Pj). The 
minimum weight perfect matching was then computed; 
this gives a one-to-one mapping between true and pre- 
dicted transcripts, therefore true transcripts can be 
ordered in the same order as they match predicted tran- 
scripts and code for the index, y(j), is no longer 
required. Let edit code length for an edge between T t 



and Pj be \y{d + 1) editsencodedfT/, Pj)\, where d is the 
edit distance. Let bitscore be edit code length divided by 
|y(|7/| + 1) editsencoded(7^ e)|; bitscore is asymmetric, 
and possibly greater than 1 if e would be a better match 
to T t than to Pi, but minimum weight perfect matching 
chose otherwise for global minimality. Let us also call 
relative expression level difference the ratio \e(Pj)-e(T^\le 
(Ti). Each matched node pair with relative expression 
level difference and (edit) bitscore under some given 
thresholds define a true positive event (TP). The other 
kind of nodes define false positive (FP) and false nega- 
tive (FN) events depending on which side of the bipar- 
tite graph they reside. Prediction efficiency based on 
precision, recall and F-measure is also employed in [6,9]. 

Simulated human data 

As in the case of the other tools, we deem that validating 
against simulated data is a prerequisite, since, in general, 
on real data, we do not have available ground-truth. We 
designed the following validation experiment, closely fol- 
lowing the approaches in [6,9]. We chose a set of genes 
at random, and looked up the corresponding annotated 
transcripts from the Ensembl database. Out of these 
genes, we selected only those having between 2 and 5 
transcripts. In all, we had 29 genes. For each transcript, 
we simulated reads with the RNASeqReadSimulator [24] . 
This simulator chooses an expression level at random 
from lognormal distribution with mean -4 and variance 
1. For each gene, it simulated paired-end reads, with frag- 
ment length mean 300 and standard deviation 20, as fol- 
lows: a transcript was chosen randomly using its 
expression level as distribution weight, while the position 
of the read within the transcript was chosen uniformly. 
As argued in the case of IsoLasso [9], various error mod- 
els can be incorporated in these steps, but we chose to 
compare the performance of the methods in neutral con- 
ditions. We mapped the reads with TopHat [25]: these 
read mapping results were given as input to the tested 
prediction software, and to a Python program which we 
wrote to construct the splicing graphs needed for Traph. 
Cufflinks and IsoLasso were ran with the default para- 
meters, because the parameters they offer relate to RNA- 
seq lab protocol, which was not simulated; we could not 
see changes to other parameters which could be relevant 
to the prediction. We use FPKM values as expression 
levels. 

We devised two experiment setups. In the first one, 
which we call single genes, 300, 000 paired-end reads 
were generated independently from the transcripts of 
each of the 29 genes, with the already assigned expression 
levels. They were independently given to TopHat for 
alignment, and these independent alignment results were 
fed to each tool. In the second, more realistic experiment, 
which we call batch mode, the transcripts and their 
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assigned expression levels were combined into one file, 
and from this file 29 * 300, 000 paired-end reads were 
simulated. All these reads were fed to TopHat for align- 
ment, and these combined alignment results were fed to 
the tools. The fragment length mean and standard devia- 
tion were passed to the tools, except for Cufflinks in batch 
mode, when it was able to infer them automatically. 

Table 1 and Figure 2 show selected validation results. 
The measures reported are precision = TP/(TP+FP), 
recall = TP/(TP+FN), and F-measure = 2 * precision * 
recall/(precision + recall). We selected to depict two 
relative expression level differences, 0.1 and 0.9, illus- 
trating opposite expression levels matching criteria. In 
the first, we require that the predicted expression levels 
be at most 10% different from the true ones, and in the 
second they can be at most 90% different from the true 
ones. Even though not yet employing paired-end infor- 
mation, Traph has better F-measure in three out of four 
scenarios. The lead of Traph is more visible in the batch 
mode scenario when the predicted expression levels can 
be at most 90% different from the true ones (Figure 2). 
This behavior might be due to the upward/downward 
coverage at the start/end of transcripts, which affects 
the average coverage Traph is assuming for source/sink 
nodes (exons). We expect to solve this by giving less 
weight to such exons in the fitness function. Notice also 
that out of the false positive transcripts reported by the 
tools, Cufflinks is reporting 32 transcripts which do not 
map to the areas of the 29 genes from where reads were 
simulated, IsoLasso is reporting 150 transcripts outside 
gene areas, while Traph is reporting only 15. 

Real human data 

We used the same real dataset from the IsoLasso paper 
[9], Caltech RNA-Seq track from the ENCODE project 
[GenBank:SRR065504], consisting of 75bp paired-end 
reads. Out of these reads, we picked the 2,406,339 
which mapped to human chromosome 2. We selected 
the 674 genes where all three tools made some predic- 
tion; these genes have 6075 annotated transcripts. 

First, we match the transcripts predicted by each tool 
with the annotated transcripts, employing the same 
minimum weight perfect matching method introduced 
before, this time without taking into account expression 



levels. A true positive is a match selected by the perfect 
matching with bitscore under 0.2. Traph predicted in 
total 2685 transcripts for these genes, out of which 244 
match the annotation. Cufflinks predicted in total 1796, 
out of which 349 match the annotation, while IsoLasso 
predicted 1362, out of which 343 match the annotation. 
We also include a histogram (Figure 3) of the lengths of 
the annotated transcripts of these genes, and of the ones 
reported by Traph, Cufflinks and IsoLasso. Here we 
round all transcript lengths to the nearest multiple of 
1000. We see that the distribution in the case of Traph 
is closer to the distribution in the case of the annotated 
transcripts; than the distributions for Cufflinks and 
IsoLasso. 

Third, we match the transcripts predicted by one soft- 
ware to the transcripts predicted by the other two, 
employing the same matching method. As in [9], we 
depict in Figure 4 a more detailed Venn diagram of the 
intersections between the sets of transcripts reported by 
the three tools. 

Running times 

On the real dataset, Cufflinks finished in 20 min, IsoLasso 
in 2 min, and Traph in 30 min. We should however stress 
that for solving the min-cost flow problem and for identi- 
fying the transcripts, Traph uses in fact 6 min, the rest of 
the time being spent by our graph creation tool, which is 
written in Python. We could not make such a detailed 
analysis in the case of the other two tools. The running 
time of our Python script is as well included in the last 
column of Table 1, where we listed the average running 
time per gene with simulated reads of each tool. 

Conclusions 

All tools for isoform identification and quantification 
use an explicit or implicit graph model. Resorting to 
such a representation, the main contribution of this 
paper consists in a novel, radically different method 
based on minimum-cost flows, an established problem, 
for which there exist polynomial-time algorithms and 
solvers. We implemented this method into our tool 
Traph. Even though Traph is not using paired-end 
information at this moment, Traph is competitive with 
state-of-the-art tools. 



Table 1 Performance of the three tools 





Precision 




Recall 


F-measure 


Avg. run time/gene 




(0.1, 0.2) 


(0.9, 0.2) 


(0.1, 0.2) 


(0.9, 0.2) 


(0.1, 0.2) 


(0.9, 0.2) 




IsoLasso 


0.0183 


0.2126 


0.0258 


0.3109 


0.0214 


0.2132 


25 s 


Cufflinks 


0.0545 


0.3623 


0.0482 


0.2931 


0.0507 


0.3060 


40 s 


Traph 


0.0862 


0.3729 


0.0689 


0.3988 


0.0747 


0.3665 


72 s 


Performance of the three tools 


under scrutiny in the 


single genes scenario; precision, recal 


and F-measure are 


computed for (relative expression level difference, 



bitscore) e {(0.1,0.2),(0.9,0.2)} 
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(a) single genes, relative expression difference thresh- 
old 0.1 
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(b) single genes, relative expression difference thresh- 
old 0.9 
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(d) batch mode, relative expression difference thresh- 
old 0.9 



Figure 2 Performance on simulated data. Performance of IsoLasso, Cufflinks, and Traph on simulated data: single genes scenario (a), (b); batch 
mode scenario (c), (d) 
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IsoLasso Cufflinks 

Figure 4 Results on real human data. Venn diagram of the 
intersections of the sets of reported transcripts 



This leads us to expect that once we incorporate paired- 
end read information, the performance of Traph will 
increase significantly. Note also that in the current imple- 
mentation, each exon equally contributes to the fitness 
function, independently of its length; we plan to include 
exon lengths in the fitness function in a future implemen- 
tation. We also plan to integrate existing gene annotation 
into a more refined construction of the splicing graph and 
into the fitness model. Our method is general enough to 
easily accommodate other biological assumptions. In order 
to evaluate the tools against real ground-truth data, we 
have started a process of acquiring long sequencing reads 
(PacBio) of the true isoforms of a gene. 
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